Structural Equivalence Between Co-occurrences of Characters and Words in the Chinese Language

نویسندگان

Yuming Shi

Wei Liang

Jing Liu

Chi K. Tse

چکیده

Complex networks are constructed for studying the co-occurrence of characters and words in the Chinese language. Two types of networks are investigated. In the first type, nodes correspond to Chinese characters, and in the second type, nodes correspond to Chinese words. Moreover, edges correspond to connections of characters and/or words that occur consecutively. Networks are built from a collection of Chinese texts of four different styles, namely, essays, novels, popular science articles, and news reports. Their statistical properties are studied in terms of some complex network parameters, including average degree, diameter, average path length, clustering coefficient, degree distribution, as well as connected subnetworks. It is found that although these two kinds of networks have different parameter values, they display qualitatively similar properties, such as exhibition of small-world and scale-free features. This qualitative equivalence between the network of Chinese characters and the network of Chinese words provides a valid basis on which either types of networks can be used for comparing different languages regardless of the incompatibility of the linguistic roles that words play in the Chinese language and in other languages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Hybrid Models for Chinese Unknown Word Resolution Dissertation

Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistic...

متن کامل

Using $k$-way Co-occurrences for Learning Word Embeddings

Co-occurrences between two words provide useful insights into the semantics of those words. Consequently, numerous prior work on word embedding learning have used co-occurrences between two words as the training signal for learning word embeddings. However, in natural language texts it is common for multiple words to be related and co-occurring in the same context. We extend the notion of co-oc...

متن کامل

“Those Nation Wreckers are Suffering from Inferiority Complex”: The Depiction of Chinese Miners in the Ghanaian Press

This article studies the depiction of Chinese miners in the Ghanaian news website entitled Modern Ghana. A total of 87 articles comprising 43752 words were retrieved. Van Leeuwen’s (2008) theory of the representation of the social actors was utilised to examine the depiction of Chinese miners in the Ghanaian press. In this regard, six applicable tools were used and these include exclusion, role...

متن کامل

Variations of the Morse-Hedlund Theorem for k-Abelian Equivalence

In this paper we investigate local-to-global phenomena for a new family of complexity functions of infinite words indexed by k ∈ N1∪{+∞} where N1 denotes the set of positive integers. Two finite words u and v in A∗ are said to be k-abelian equivalent if for all x ∈ A∗ of length less than or equal to k, the number of occurrences of x in u is equal to the number of occurrences of x in v. This def...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Structural Equivalence Between Co-occurrences of Characters and Words in the Chinese Language

نویسندگان

چکیده

منابع مشابه

A New Document Embedding Method for News Classification

Hybrid Models for Chinese Unknown Word Resolution Dissertation

Using $k$-way Co-occurrences for Learning Word Embeddings

“Those Nation Wreckers are Suffering from Inferiority Complex”: The Depiction of Chinese Miners in the Ghanaian Press

Variations of the Morse-Hedlund Theorem for k-Abelian Equivalence

عنوان ژورنال:

اشتراک گذاری